Datascience is an interdisciplinary academic field to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data.
Why datascience? There are many reasons, but I believe it would promote evidence based, non-biased thinking.
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not. (Posit Primers: EDA)
The study of the entire economy of a region, a country, or the entire world
World Development Indicators (WDI) is the World Bank’s premier compilation of international statistics on global development. Drawing from officially recognized sources and including national, regional, and global estimates, the WDI provides access to approximately 1,600 indicators for 217 economies, with some time series extending back more than 50 years. The database helps users find information related to development, both current and historical. The topics covered in the WDI range from poverty, health, and demographics to GDP, trade, and the environment.
The World Development Indicators website provides access to data as well as information about data coverage, curation, and methodologies, and allows users to discover what type of indicators are available, how they are collected, and how they can be visualized to analyze development trends. (Wikipedia: Link)
Datascience and Artificial Intelligence
Computer Resources
Data
Big Data
Open and Public Data
Q. Explain ‘Open Data’ and ‘Public Data’, and their differences.
A. AI Poe Assistant. (Poe.com)
“Open data” and “public data” are related concepts that refer to different aspects of data accessibility and usage.
Open Data: Open data refers to the idea that certain data should be freely available to the public, without restrictions on access, usage, or redistribution. Open data is typically released in a machine-readable format, such as CSV, JSON, or XML, and is accompanied by an open license that allows anyone to use, reuse, and distribute the data for any purpose. The goal of open data is to promote transparency, collaboration, and innovation by enabling individuals, organizations, and communities to analyze and build upon the data.
Open data often includes government data, scientific research data, public records, and other datasets that are deemed to be of public interest. Governments, international organizations, and various institutions actively promote the release of open data to foster accountability, drive economic growth, and empower citizens to make informed decisions.
Public Data: Public data, on the other hand, refers to any data that is accessible or available to the public. It is a broader term that encompasses both open data and other types of data that may have certain restrictions on access or usage. Public data can include information that is freely and openly available, as well as data that is accessible through specific channels, such as public libraries, government agencies, or subscription-based services.
Public data can include a wide range of information, such as government reports, census data, legal documents, financial statements, and more. While public data may be accessible to the public, it does not necessarily guarantee the same level of openness, usability, or freedom as open data. Public data may have limitations on its usage, copyright restrictions, or conditions imposed by the data provider.
In summary, open data specifically refers to data that is freely accessible, usable, and redistributable without restrictions, often accompanied by an open license. Public data, on the other hand, is a broader term that includes both open data and other data that may have varying degrees of accessibility and usage restrictions.
Visit Open Government Data Toolkit.
Posit.cloud: https://posit.cloud/
Posit Primers: interactive tutorials
Cheatsheets: RStudio IDE, Data Visualization, Data Transformation, Data Import, RMarkdown, etc.
Shared Project: https://posit.cloud/content/5539763
Create Your Copy and Download it.
We use two packages, tidyverse and WDI.
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardized collection format that can be installed by users of R, typically via a centralized software repository such as CRAN (the Comprehensive R Archive Network). [CRAN Link]
tidyverse: The tidyverse is a collection of open source packages for the R programming language introduced by Hadley Wickham and his team that “share an underlying design philosophy, grammar, and data structures” of tidy data. Characteristic features of tidyverse packages include extensive use of non-standard evaluation and encouraging piping. [CRAN Link]
WDI: Search and download data from over 40 databases hosted by the World Bank, including the World Development Indicators (‘WDI’), International Debt Statistics, Doing Business, Human Capital Index, and Sub-national Poverty indicators. [CRAN Link]
Step 1. Install packages if necessary.
install.packages("tidyverse")
install.packages("WDI")
Step 2. Load packages.
library(tidyverse)
library(WDI)
Step 3. Create a data directory for the first time.
dir.create("data")
Step 4. Set ‘System Language’ to be English, recommended.
Sys.setenv(LANG = "en")
The following code chunk is to download GDP data with the following indicator code.
WDI indicator: NY.GDP.MKTP.PP.KD
df_gdp <- WDI(indicator = "NY.GDP.MKTP.PP.KD")
N.B. There are many GDP related data in WDI, for example, “NY.GDP.MKTP.CD”
To avoid the internet traffic, save the data and reuse it.
CSV: comma separated values, a text format of a data.
write_csv(df_gdp, "data/gdp.csv")
Run codes above only once to download and write the data into the data directory.
df_gdp <- read_csv("data/gdp.csv")
Rows: 16758 Columns: 5── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, iso2c, iso3c
dbl (2): year, NY.GDP.MKTP.PP.KD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
First, look at the data by head, str
ucture, and summary.
head: display the first 6 rows by default
head(df_gdp)
You can change the number of rows to output as follows.
Observe that data before 1990 are missing, or not available, NA.
head(df_gdp, 50)
2.561800e+12 is in scientific notation, i.e., \(2.561800 \times10^{12} = 2,562,800,000,000\).
str: display the structure of an object
str(df_gdp)
spc_tbl_ [16,758 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" "Africa Eastern and Southern" ...
$ iso2c : chr [1:16758] "ZH" "ZH" "ZH" "ZH" ...
$ iso3c : chr [1:16758] "AFE" "AFE" "AFE" "AFE" ...
$ year : num [1:16758] 2022 2021 2020 2019 2018 ...
$ NY.GDP.MKTP.PP.KD: num [1:16758] 2.56e+12 2.47e+12 2.37e+12 2.43e+12 2.38e+12 ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. NY.GDP.MKTP.PP.KD = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary: display the summary of an object
summary(df_gdp)
country iso2c iso3c year NY.GDP.MKTP.PP.KD
Length:16758 Length:16758 Length:16758 Min. :1960 Min. :2.482e+07
Class :character Class :character Class :character 1st Qu.:1975 1st Qu.:1.824e+10
Mode :character Mode :character Mode :character Median :1991 Median :1.055e+11
Mean :1991 Mean :3.329e+12
3rd Qu.:2007 3rd Qu.:1.083e+12
Max. :2022 Max. :1.390e+14
NA's :9096
In RNotebook, the following also displays the first 1000 rows of the data in the paged format.
df_gdp
|> , or %>%, is called a pipe
operator and df_gdp |> filter(country == COUNTRY) is
same as
filter(df_gdp, country == COUNTRY) .
filter : Keep rows that match a condition
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY)
ggplot() + geom_line(): A tidyverse function of draw a
line graph
aes(year, NY.GDP.MKTP.PP.KD) : aesthetic mapping sending
year to x-axis and NY.GDP.MKTP.PP.KD to y-axis
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Let’s delete the rows with missing values using
drop_na(NY.GDP.MKTP.PP.KD), a transformation.
COUNTRY <- "Japan"
df_gdp |> filter(country == COUNTRY) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
COUNTRY <- "World"
df_gdp |> filter(country == COUNTRY) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD)) + geom_line()
Observations and Questions
e.g. The GDP of the world is continuously increasing since 1990.
e.g. There are drops at 2009 and 2019.
By country names
COUNTRIES <- c("Japan", "China", "India", "United Kingdom", "United States", "Germany", "France")
df_gdp |> filter(country %in% COUNTRIES) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c("JP", "CN", "ID", "UK", "US", "DE", "FR")
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
What happens if you replace color = iso2c at the bottom
of the code above with colour = iso2c ,
color = country , col = country ?
ISO2C <- c("JP", "CN", "ID", "UK", "US", "DE", "FR")
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, colour = country)) + geom_line()
(df_codes <- df_gdp |> distinct(country, iso2c))
Set COUNTRIES and/or ISO2C to draw line graphs of GDP.
BRICS <- c("Brazil", "Russian Federation", "India", "China", "South Africa") # surround the country name with quotation marks, and use a comma as a separator
df_gdp |> filter(country %in% BRICS) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
ISO2C <- c("BR", "CN", "IN", "RU", "ZA") # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ASEAN <- c("Brunei Darussalam", "Cambodia", "Indonesia", "Lao PDR", "Malaysia", "Myanmar", "Philippines", "Singapore") # surround the country name with quotation marks, and use a comma as a separator
df_gdp |> filter(country %in% ASEAN) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = iso2c)) + geom_line()
ISO2C <- c("BN", "ID", "KH", "LA", "MM", "MY", "PH", "SG") # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) + geom_line()
ISO2C <- c("BR", "CN", "IN", "RU", "ZA") # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) +
scale_color_manual(values = c("red", "orange", "yellow", "blue", "green")) +
geom_line()
ISO2C <- c("BR", "CN", "IN", "RU", "ZA") # surround the iso2c code with quotation marks, and use a comma as a separator
df_gdp |> filter(iso2c %in% ISO2C) |> drop_na(NY.GDP.MKTP.PP.KD) |>
ggplot(aes(year, NY.GDP.MKTP.PP.KD, color = country)) +
scale_color_manual(values = c("#999999", "#E69F00", "#56B4E9", "#FF0000", "#00FF00")) +
geom_line()
World Bank Home Page
Excel Files
API Search
WDIsearch(string = "gdp", field = "name")
WDIsearch(string = "NY.GDP.MKTP.PP.KD", field = "indicator", short = FALSE)
Find at least one WDI indicator with its name and its code.
Find at least one pair of WDI indicators with their names and their codes you want to study their relation.
GDP, PPP (constant 2017 international $): NY.GDP.MKTP.PP.KD
Population, total: SP.POP.TOTL
Calculate GDP per Capita
GDP, PPP (constant 2017 international $) PPP GDP is gross domestic product converted to international dollars using purchasing power parity rates. An international dollar has the same purchasing power over GDP as the U.S. dollar has in the United States. GDP is the sum of gross value added by all resident producers in the country plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2017 international dollars. ID: NY.GDP.MKTP.PP.KD
Population, total Total population is based on the de facto definition of population, which counts all residents regardless of legal status or citizenship. The values shown are midyear estimates. ID: SP.POP.TOTL
df_gdppcap <- WDI(indicator = c(gdp = "NY.GDP.MKTP.PP.KD", pop = "SP.POP.TOTL", gdppcap = "NY.GDP.PCAP.PP.KD"), extra = TRUE)
write_csv(df_gdppcap, "data/gdppcap.csv")
df_gdppcap <- read_csv("data/gdppcap.csv")
Rows: 16758 Columns: 15── Column specification ────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (6): year, gdp, pop, gdppcap, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(df_gdppcap)
spc_tbl_ [16,758 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country : chr [1:16758] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
$ iso2c : chr [1:16758] "AF" "AF" "AF" "AF" ...
$ iso3c : chr [1:16758] "AFG" "AFG" "AFG" "AFG" ...
$ year : num [1:16758] 2014 2012 2009 2013 1971 ...
$ status : logi [1:16758] NA NA NA NA NA NA ...
$ lastupdated: Date[1:16758], format: "2023-09-19" "2023-09-19" "2023-09-19" ...
$ gdp : num [1:16758] 7.02e+10 6.47e+10 4.99e+10 6.83e+10 NA ...
$ pop : num [1:16758] 32716210 30466479 27385307 31541209 11015857 ...
$ gdppcap : num [1:16758] 2144 2123 1824 2165 NA ...
$ region : chr [1:16758] "South Asia" "South Asia" "South Asia" "South Asia" ...
$ capital : chr [1:16758] "Kabul" "Kabul" "Kabul" "Kabul" ...
$ longitude : num [1:16758] 69.2 69.2 69.2 69.2 69.2 ...
$ latitude : num [1:16758] 34.5 34.5 34.5 34.5 34.5 ...
$ income : chr [1:16758] "Low income" "Low income" "Low income" "Low income" ...
$ lending : chr [1:16758] "IDA" "IDA" "IDA" "IDA" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. iso2c = col_character(),
.. iso3c = col_character(),
.. year = col_double(),
.. status = col_logical(),
.. lastupdated = col_date(format = ""),
.. gdp = col_double(),
.. pop = col_double(),
.. gdppcap = col_double(),
.. region = col_character(),
.. capital = col_character(),
.. longitude = col_double(),
.. latitude = col_double(),
.. income = col_character(),
.. lending = col_character()
.. )
- attr(*, "problems")=<externalptr>
df_gdppcap |> select(region, income, lending) |> lapply(unique)
$region
[1] "South Asia" "Aggregates" "Europe & Central Asia"
[4] "Middle East & North Africa" "East Asia & Pacific" "Sub-Saharan Africa"
[7] "Latin America & Caribbean" "North America" NA
$income
[1] "Low income" "Aggregates" "Upper middle income" "Lower middle income"
[5] "High income" NA "Not classified"
$lending
[1] "IDA" "Aggregates" "IBRD" "Not classified" "Blend"
[6] NA
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, gdppcap)) + geom_line()
COUNTRY <- "World"
df_gdppcap |> filter(country == COUNTRY) |>
ggplot(aes(year, pop)) + geom_line()
Write your observations and questions.
e.g. GDP per capita data is not available before 1990 but Population data is available from 1960.
df_gdppcap2 <- df_gdppcap |> drop_na(pop) |>
mutate(PCAP = gdp/pop, .after = gdppcap)
df_gdppcap2
df_gdppcap2 |> drop_na(gdppcap, PCAP) |> mutate(near = near(gdppcap, PCAP)) |>
summarize(numberofdata = n(), sum(near))
df_gdppcap2 |> filter(!near(gdppcap, PCAP))
df_gdppcap2 |> filter(!near(gdppcap, PCAP)) |> distinct(country) |> pull()
[1] "Cyprus" "Morocco" "Russian Federation" "Sudan"
[5] "Tanzania" "Ukraine"
Write your observations and questions.
Q. In WDI of the World Bank, NY.GDP.MKTP.PP.KD/SP.POP.TOTL does not match up with NY.GDP.PCAP.PP.KD in several countries, “Cyprus”, “Morocco”, “Russian Federation”, “Sudan”, “Tanzania”, “Ukraine”. Why?
A. ChatGPT3.5.
The mismatch between NY.GDP.MKTP.PP.KD/SP.POP.TOTL and NY.GDP.PCAP.PP.KD in the World Bank’s World Development Indicators (WDI) for certain countries like Cyprus, Morocco, Russian Federation, Sudan, Tanzania, and Ukraine can occur due to several reasons:
Data sources and methodologies: The World Bank collects data from various sources, including national statistical agencies and international organizations. Different sources may use different methodologies or estimation techniques, leading to discrepancies in the reported values for GDP and population.
Availability and accuracy of data: Data availability and accuracy can vary across countries. Some countries may have more reliable and comprehensive data for GDP per capita, while data for total GDP or population may be less accurate or less frequently updated.
Purchasing Power Parity (PPP) adjustments: NY.GDP.MKTP.PP.KD/SP.POP.TOTL represents GDP per capita at purchasing power parity. PPP adjustments aim to account for differences in the cost of living across countries. However, estimating PPP values involves complex calculations and assumptions, which can introduce discrepancies between GDP per capita at PPP and GDP per capita without PPP adjustments.
Currency fluctuations and exchange rates: Currency fluctuations can affect the reported values for GDP per capita. NY.GDP.PCAP.PP.KD uses constant 2017 international dollars, while NY.GDP.MKTP.PP.KD/SP.POP.TOTL may use current or constant prices in national currencies. Exchange rate fluctuations can impact the conversion of national currencies to international dollars, resulting in discrepancies between the two indicators.
It is important to note that these discrepancies do not necessarily indicate errors in the data, but rather differences in the methodologies, sources, and adjustments used in the calculations. Users of the World Bank’s WDI should exercise caution and consider these factors when comparing and interpreting the data.
(HS: Sorry it does not say much.)
Two useful questions.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
See Link.
arrange(desc(gdp)) is to reorder in descending order of
gdp, arrange(gdp) in ascending order.
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdp) |> arrange(desc(gdp))
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdppcap) |> arrange(desc(gdppcap))
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(gdppcap) |> arrange(gdppcap)
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(pop) |> arrange(desc(pop))
df_gdppcap |> filter(year == 2022, region != "Aggregates") |>
drop_na(pop) |> arrange(pop)
What type of covariation occurs between my variables?
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp)) + geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2022, region !="Aggregates") |>
drop_na(gdp, pop) |> lm(log10(gdp) ~ log10(pop), data = _) |> summary()
Call:
lm(formula = log10(gdp) ~ log10(pop), data = drop_na(filter(df_gdppcap2,
year == 2022, region != "Aggregates"), gdp, pop))
Residuals:
Min 1Q Median 3Q Max
-1.22646 -0.39512 0.03996 0.42553 0.95842
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.45320 0.27485 16.20 <2e-16 ***
log10(pop) 0.94704 0.03998 23.69 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5117 on 181 degrees of freedom
Multiple R-squared: 0.7561, Adjusted R-squared: 0.7548
F-statistic: 561.2 on 1 and 181 DF, p-value: < 2.2e-16
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, pop) |>
ggplot(aes(pop, gdp, color = region, shape = income)) + geom_point() +
scale_x_log10() + scale_y_log10()
df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |>
drop_na(gdp, gdppcap, pop) |>
ggplot(aes(gdppcap, gdp, color = region, size = pop)) + geom_point() +
scale_x_log10() + scale_y_log10()
install.packages("plotly")
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
test <- df_gdppcap2 |> filter(year == 2020, region !="Aggregates") |> drop_na(gdp, pop) |>
ggplot(aes(color = country, shape = region, pop, gdp)) + geom_point() +
scale_x_log10() + scale_y_log10() + theme(legend.position = "none")
test |> ggplotly()
Warning: The shape palette can deal with a maximum of 6 discrete values because more than 6
becomes difficult to discriminate; you have 7. Consider specifying shapes manually if you
must have them.